
For best experience open the pre-rendered .html file
You have 2 options depending on how much you trust my code and how little time you want to dedicate to this review :)
Why you should NOT visualize this notebook on github
Some charts/diagrams/features are not visible in github.
This is standard and well-known behaviour.
Requirements for running this locally
python 3.9pip install -r requirements.txt before proceeding.For this dataset we want to explore different ways to predict a specific column "insurance policy" to assist our ficticious insurance company.
A few important things to remember about the dataset we are going to be using:
Based on all these attributes, we are trying to predict which customers decided to buy travel insurance.
The dataset is relatively small (2k rows) because we are a small travel agency.
Since the number of people who book their vacation through us is fairly small, we want to maximize our chances of selling travel insurance policies. This is a critical revenue stream for us and we cannot afford to leave money on the table.
This is why we will be looking at models performances, and in particular, we want to find models that have good recall performance.
We want the model to err on the side of "this person is likely to buy, suggest it to them" instead of "I am only flagging customers that I know for sure will buy this insurance".
One simple (non-ML) policy could be "suggest it to everyone", but that would be energy consuming. We don't want to suggest it to all of them and get a bad reputation as an aggressive travel agency, but we do want to end up with a model that has strong recall.
The ideal situation for us would be:
| did not offer insurance | offered insurance | |
|---|---|---|
| did not want insurance | 😐 no one cares | 🤷 okay... |
| wanted insurance | ⚠️ TERRIBLE worst mistake you could have ever made |
✅ GREAT |
So we want models with Perfect True Positive scores, and (more importantly) No False Negatives.
The other two possibilities (True negative and False positive) don't really affect us.
| did not offer insurance | offered insurance | |
|---|---|---|
| did not want insurance | ? | ? |
| wanted insurance | must be 0 | as high as possible |
We will try to create a custom loss function that prioritizes this goal. Let's see how that goes.
import ipywidgets as widgets
from IPython.display import display, Markdown, Image, clear_output, HTML
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import duckdb
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from tqdm import tqdm
import time
from collections import deque
from random import random, seed
import sqlite3 as lite
import logging
import warnings
import ydata_profiling
import iplantuml
import xml.dom.minidom
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import (
ConfusionMatrixDisplay,
accuracy_score,
recall_score,
precision_score,
)
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import plot_tree
from sklearn.metrics import make_scorer, confusion_matrix
from utils import *
from utils import __
from insurance import *
from autofeat import AutoFeatRegressor as AFR
from sklearn.model_selection import train_test_split
from sklearn.exceptions import DataConversionWarning
seed(100)
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 50
util.check("done")
loading insurance modules... ✅ completed configuring autoreload... ✅ completed ✅
Let's use black to auto-format all our cells so they adhere to PEP8
import lab_black
%reload_ext lab_black
util.patch_nb_black()
# fmt: off
# fmt: on
logger = util.configure_logging(jupyterlab_level=logging.WARN, file_level=logging.DEBUG)
warnings.filterwarnings("ignore", category=FutureWarning)
We're ready to start! Let's download the dataset from Kaggle.
dataset_name = "tejashvi14/travel-insurance-prediction-data"
db_filename = "TravelInsurancePrediction.csv"
auto_kaggle.download_dataset(dataset_name, db_filename)
__
Kaggle API 1.5.13 - login as 'edualmas' File [dataset/TravelInsurancePrediction.csv] already exists locally! No need to re-download dataset [tejashvi14/travel-insurance-prediction-data]
raw_ = auto_pandas.from_csv(f"dataset/{db_filename}", read_csv_kws={"index_col": 0})
raw_.head()
| age | employment type | graduateornot | annualincome | familymembers | chronicdiseases | frequentflyer | evertravelledabroad | travelinsurance | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 31 | Government Sector | Yes | 400000 | 6 | 1 | No | No | 0 |
| 1 | 31 | Private Sector/Self Employed | Yes | 1250000 | 7 | 0 | No | No | 0 |
| 2 | 34 | Private Sector/Self Employed | Yes | 500000 | 4 | 1 | No | No | 1 |
| 3 | 28 | Private Sector/Self Employed | Yes | 700000 | 3 | 1 | No | No | 0 |
| 4 | 28 | Private Sector/Self Employed | Yes | 700000 | 8 | 1 | Yes | No | 0 |
def bool_to_num(df: pd.DataFrame, col: str) -> pd.DataFrame:
df[f"{col}_number"] = -1
df.loc[df[col] == "Yes", f"{col}_number"] = 1
df.loc[df[col] == "No", f"{col}_number"] = 0
df = df.drop(columns=[col])
df = df.rename(columns={f"{col}_number": col})
return df
def employment_to_num(df: pd.DataFrame, col: str) -> pd.DataFrame:
df[f"{col}_number"] = -1
df.loc[df[col] == "Yes", f"{col}_number"] = 1
df.loc[df[col] == "No", f"{col}_number"] = 0
df = df.drop(columns=[col])
df = df.rename(columns={f"{col}_number": col})
return df
clean = raw_.copy()
clean = bool_to_num(clean, "graduateornot")
clean = bool_to_num(clean, "frequentflyer")
clean = bool_to_num(clean, "evertravelledabroad")
clean = pd.get_dummies(clean, drop_first=True, columns=["employment type"])
clean = clean.rename(
columns={"employment type_Private Sector/Self Employed": "employment_private"}
)
clean.head()
| age | annualincome | familymembers | chronicdiseases | travelinsurance | graduateornot | frequentflyer | evertravelledabroad | employment_private | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 31 | 400000 | 6 | 1 | 0 | 1 | 0 | 0 | 0 |
| 1 | 31 | 1250000 | 7 | 0 | 0 | 1 | 0 | 0 | 1 |
| 2 | 34 | 500000 | 4 | 1 | 1 | 1 | 0 | 0 | 1 |
| 3 | 28 | 700000 | 3 | 1 | 0 | 1 | 0 | 0 | 1 |
| 4 | 28 | 700000 | 8 | 1 | 0 | 1 | 1 | 0 | 1 |
Let's check that all of the values were as expected:
cols = ["graduateornot", "frequentflyer", "evertravelledabroad"]
print("amount of unknowks in each of the classified cols:")
for col in cols:
print(col, clean[clean[col] == -1].shape[0])
amount of unknowks in each of the classified cols: graduateornot 0 frequentflyer 0 evertravelledabroad 0
clean = clean.rename(
columns={
"annualincome": "income_rupees",
"familymembers": "family_size",
"chronicdiseases": "has_chronic_disease",
"travelinsurance": "travel_insurance",
"graduateornot": "has_graduated",
"frequentflyer": "is_frequent_flyer",
"evertravelledabroad": "has_travelled_abroad",
}
)
Let's reorder the columns so the predicted variable is at the end of the df
clean.columns = [
"age",
"income_rupees",
"family_size",
"has_chronic_disease",
"has_graduated",
"is_frequent_flyer",
"has_travelled_abroad",
"employment_private",
"travel_insurance",
]
clean
| age | income_rupees | family_size | has_chronic_disease | has_graduated | is_frequent_flyer | has_travelled_abroad | employment_private | travel_insurance | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 31 | 400000 | 6 | 1 | 0 | 1 | 0 | 0 | 0 |
| 1 | 31 | 1250000 | 7 | 0 | 0 | 1 | 0 | 0 | 1 |
| 2 | 34 | 500000 | 4 | 1 | 1 | 1 | 0 | 0 | 1 |
| 3 | 28 | 700000 | 3 | 1 | 0 | 1 | 0 | 0 | 1 |
| 4 | 28 | 700000 | 8 | 1 | 0 | 1 | 1 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1982 | 33 | 1500000 | 4 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1983 | 28 | 1750000 | 5 | 1 | 0 | 1 | 0 | 1 | 1 |
| 1984 | 28 | 1150000 | 6 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1985 | 34 | 1000000 | 6 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1986 | 34 | 500000 | 4 | 0 | 0 | 1 | 0 | 0 | 1 |
1987 rows × 9 columns
Let's also include a monetary amounts in EUR, to get a better understanding of magnitudes
# correct as of 2023/05/14
rupees_to_eur = 0.01118
clean["income_eur"] = clean["income_rupees"] * rupees_to_eur
sns.histplot(data=clean, x="income_eur", hue="travel_insurance", multiple="stack")
plt.title("annual income distribution in EUR")
plt.xlabel("income (converted to eur)")
plt.xlim(0)
clean = clean.drop(columns=["income_eur"])
clean.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age | 1987.0 | 29.650226 | 2.913308 | 25.0 | 28.0 | 29.0 | 32.0 | 35.0 |
| income_rupees | 1987.0 | 932762.959235 | 376855.684748 | 300000.0 | 600000.0 | 900000.0 | 1250000.0 | 1800000.0 |
| family_size | 1987.0 | 4.752894 | 1.60965 | 2.0 | 4.0 | 5.0 | 6.0 | 9.0 |
| has_chronic_disease | 1987.0 | 0.277806 | 0.44803 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| has_graduated | 1987.0 | 0.357323 | 0.479332 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| is_frequent_flyer | 1987.0 | 0.851535 | 0.35565 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| has_travelled_abroad | 1987.0 | 0.209864 | 0.407314 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| employment_private | 1987.0 | 0.191243 | 0.393379 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| travel_insurance | 1987.0 | 0.713135 | 0.452412 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
@run
@cached_chart()
def pairplot_with_corrs():
f = plt.figure(figsize=(8, 8))
return charts.pairplot_with_corr(clean)
Loading from cache [./cached/charts/pairplot_with_corrs.png]
This is the discovery of the century: People who have a large annual income have a strong correlation with:
clean.corr()
| age | income_rupees | family_size | has_chronic_disease | has_graduated | is_frequent_flyer | has_travelled_abroad | employment_private | travel_insurance | |
|---|---|---|---|---|---|---|---|---|---|
| age | 1.000000 | -0.020101 | 0.027409 | 0.007359 | 0.061060 | 0.027125 | -0.033159 | -0.012779 | -0.115134 |
| income_rupees | -0.020101 | 1.000000 | -0.015367 | -0.001149 | 0.396763 | 0.108066 | 0.353087 | 0.486043 | 0.349157 |
| family_size | 0.027409 | -0.015367 | 1.000000 | 0.028209 | 0.079909 | 0.021201 | -0.023775 | -0.020755 | -0.003354 |
| has_chronic_disease | 0.007359 | -0.001149 | 0.028209 | 1.000000 | 0.018190 | 0.018811 | -0.043720 | 0.021238 | -0.011553 |
| has_graduated | 0.061060 | 0.396763 | 0.079909 | 0.018190 | 1.000000 | 0.018934 | 0.232103 | 0.433183 | 0.147847 |
| is_frequent_flyer | 0.027125 | 0.108066 | 0.021201 | 0.018811 | 0.018934 | 1.000000 | -0.028120 | 0.062683 | -0.127133 |
| has_travelled_abroad | -0.033159 | 0.353087 | -0.023775 | -0.043720 | 0.232103 | -0.028120 | 1.000000 | 0.277334 | 0.143790 |
| employment_private | -0.012779 | 0.486043 | -0.020755 | 0.021238 | 0.433183 | 0.062683 | 0.277334 | 1.000000 | 0.181098 |
| travel_insurance | -0.115134 | 0.349157 | -0.003354 | -0.011553 | 0.147847 | -0.127133 | 0.143790 | 0.181098 | 1.000000 |
@run
@cached_profile_report()
def profile_clean_data():
return ydata_utils.report(clean)
Loading from cache [./cached/profile_report/profile_clean_data.html]
This dataset is looking really good:
I think it's time to start modelling.
For this project we want to predict which clients are likely to purchase travel insurance.
Let's find the clients who will not be able to refuse our offers.

Here's a summary of the available algorithms, extracted directly from this fantastic talk by Brian Lange at PyData 2016
| Included in this analysis | Name | non linear | probability estimate | feature importance | updatable | parallelizable |
|---|---|---|---|---|---|---|
| ❌ | Logistic Regression | No | Yes | Yes, if you scale | maybe | maybe |
| ✅ | SVM | Yes, with kernel | No | No | depending on kernel | only for some kernels |
| ✅ | KNN | Yes | Yes, % of nearby points | No | yes | yes |
| ❌ | Naive Bayes | yes | yes | no | yes | yes |
| ✅ | Decision Trees | yes | no | kinda, number of times a feat is used | no | no (but it's very fast) |
| Random Forests ✅ Bagged KNNs ✅ |
Ensemble models | yes | kinda (% of models that agree) | kinda, depending on component parts (which models you chose) | kinda, by adding new models to the ensemble | yes |
| ❌ | Boosted models | yes | kinda (% of models that agree) | kinda, depending on component parts (which models you chose) | kinda, by adding new models to the ensemble | no |
The first model we want to try out is K-Nearest Neighbors.
performance: ModelPerformance = knn(clean, k=1, check_using="train")
performance.confusion_plot()
Not bad for such a simple and naive model. Let's see if the performance improves as we increase k
We don't need to worry about the train/test split, each of the functions will take care of that inside. All the functions will even use a fixed random seed to guarantee repeatable results.
df = clean
Let's compare KNN performance for different values of K, and for each of the 4 algorithms it supports.
We will compare performance differences and behaviour in 2 sceanarios:
ks = range(1, 100, 5)
algos = ["auto", "ball_tree", "kd_tree", "brute"]
datasets = ["train", "test"]
m = [knn(df, k, algo, dataset) for k in ks for algo in algos for dataset in datasets]
metrics_knn = metrics_as_dataframe(m)
metrics_knn
| check_using | K | algorithm | metric | score | |
|---|---|---|---|---|---|
| 0 | train | 1 | auto | accuracy | 0.976715 |
| 1 | test | 1 | auto | accuracy | 0.934673 |
| 2 | train | 1 | ball_tree | accuracy | 0.976715 |
| 3 | test | 1 | ball_tree | accuracy | 0.934673 |
| 4 | train | 1 | kd_tree | accuracy | 0.976715 |
| ... | ... | ... | ... | ... | ... |
| 635 | test | 96 | ball_tree | f1_score | 0.869427 |
| 636 | train | 96 | kd_tree | f1_score | 0.864128 |
| 637 | test | 96 | kd_tree | f1_score | 0.869427 |
| 638 | train | 96 | brute | f1_score | 0.864128 |
| 639 | test | 96 | brute | f1_score | 0.869427 |
640 rows × 5 columns
grid = sns.FacetGrid(data=metrics_knn, row="algorithm", col="metric", ylim=(0.5, 1))
grid.map_dataframe(sns.lineplot, x="K", y="score", hue="check_using")
grid.add_legend()
<seaborn.axisgrid.FacetGrid at 0x200bbcb6a00>
I doubt that KNN is ever used with K > 20, for datasets like this one, but I still wanted to explore how performance degrades as K approaches the size of the entire dataset.
As expected, the best performance seem to appear:
The last interesting thing is that the choice of algorithm seems to not affect our results in a significant way.
Let's move unto something more interesting: Decision Trees
The parameter that trees use is the max. depth allowed (how many decisions in a row they are allowed to do). Too many decisions will likely lead to overfitting while too few will lead to underfitting, and losing potential performance on the table.
performance: ModelPerformance = decision_tree(df, depth=3)
performance.confusion_plot()
Not bad for a depth of 3.
Let's take a look at the type of decisions this tree is making to classify our data
plt.figure(figsize=(20, 6))
plot_tree(
performance.extra["model"],
filled=True,
feature_names=df.drop(columns="travel_insurance").columns,
)
plt.show()
Just from a quick look we can see that income is a major factor in this decision, since it appears in 3 different nodes.
Age is the next most influential attribute.
depth = range(1, 20)
datasets = ["train", "test"]
m = [decision_tree(df, d, check_using=c) for d in depth for c in datasets]
metrics_tree = metrics_as_dataframe(m)
metrics_tree
| check_using | depth | metric | score | |
|---|---|---|---|---|
| 0 | train | 1 | accuracy | 0.790434 |
| 1 | test | 1 | accuracy | 0.821608 |
| 2 | train | 2 | accuracy | 0.795469 |
| 3 | test | 2 | accuracy | 0.826633 |
| 4 | train | 3 | accuracy | 0.796098 |
| ... | ... | ... | ... | ... |
| 147 | test | 17 | f1_score | 0.970537 |
| 148 | train | 18 | f1_score | 0.986192 |
| 149 | test | 18 | f1_score | 0.970537 |
| 150 | train | 19 | f1_score | 0.986192 |
| 151 | test | 19 | f1_score | 0.970537 |
152 rows × 4 columns
grid = sns.FacetGrid(data=metrics_tree, col="metric", ylim=(0.5, 1))
grid.map_dataframe(sns.lineplot, x="depth", y="score", hue="check_using")
grid.add_legend()
<seaborn.axisgrid.FacetGrid at 0x200bdaaea00>
It seems that the accuracy does improve a bit if we allow it to add more depth, but after a few layers it flattens out (as expected).
Again, just to reiterate: in production, we would never do what we are doing here (checking against test dataset to understand/tuning hyperparameter), but this analysis is just to check how the performances change with different combinations of the paramters, and these charts show clear evidence of overfitting once the depth goes beyond 6 or 7: the training performance keeps improving, beyond the test dataset. This can be seen in 3 of the 4 metrics inspected.
performance: ModelPerformance = random_forest(df, max_depth=3, n_estimators=100)
performance.confusion_plot()
Let's interrogate the random forest and try to extract the "consensus" around the importance of the features
importances = performance.extra["model"].feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = df.columns[indices]
plt.figure()
plt.title("Feature importance")
plt.bar(range(df.shape[1] - 1), importances[indices], color="r", align="center")
plt.xticks(range(df.shape[1] - 1), feature_names, rotation=90)
plt.xlim([-1, df.shape[1] - 1])
plt.show()
Oops! During our initial "let's try combinations" phase, things got out of hand (no need to blame any specific data developer!)
We need to be careful with caching hundreds of random forests!
def random_forests():
return {
(size, depth): random_forest(df, max_depth=depth, n_estimators=size)
for depth in range(1, 20)
for size in range(50, 1000, 50)
}
This resulted in 361 different random forests being cached to dist, occupying more than 3GB of space
Our cache on disk is not too pleased!

Our memory and swap space are also displeased about what we just did, and voiced their opinion -loudly- against doing it again.

Let's try again, but with less datapoints.
tree_depths = range(1, 8)
forest_sizes = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 30, 60, 120]
datasets = ["train", "test"]
m = [
random_forest(df, max_depth=depth, n_estimators=size, check_using=c)
for depth in tree_depths
for size in forest_sizes
for c in datasets
]
metrics_forest = metrics_as_dataframe(m)
metrics_forest
| max_depth | check_using | n_estimators | metric | score | |
|---|---|---|---|---|---|
| 0 | 1 | train | 1 | accuracy | 0.790434 |
| 1 | 1 | test | 1 | accuracy | 0.821608 |
| 2 | 1 | train | 2 | accuracy | 0.790434 |
| 3 | 1 | test | 2 | accuracy | 0.821608 |
| 4 | 1 | train | 3 | accuracy | 0.785400 |
| ... | ... | ... | ... | ... | ... |
| 723 | 7 | test | 30 | f1_score | 0.939444 |
| 724 | 7 | train | 60 | f1_score | 0.944842 |
| 725 | 7 | test | 60 | f1_score | 0.940789 |
| 726 | 7 | train | 120 | f1_score | 0.942857 |
| 727 | 7 | test | 120 | f1_score | 0.939245 |
728 rows × 5 columns
grid = sns.FacetGrid(data=metrics_forest, col="max_depth", row="metric", ylim=(0.5, 1))
grid.map_dataframe(sns.lineplot, x="n_estimators", y="score", hue="check_using")
grid.add_legend()
<seaborn.axisgrid.FacetGrid at 0x2010eff4340>
We can see that max tree depth does seem to influence accuracy and recall in a predictable way (more tree complexity = better performance), but it's forest size the one that most consistently guarantees performance.
m = metrics_forest[metrics_forest.check_using == "train"].drop(columns=["check_using"])
accuracy = m[m.metric == "accuracy"].drop(columns=["metric"])
precision = m[m.metric == "precision"].drop(columns=["metric"])
recall = m[m.metric == "recall"].drop(columns=["metric"])
f1 = m[m.metric == "f1_score"].drop(columns=["metric"])
fig = plt.figure(figsize=(12, 5))
ax1 = fig.add_subplot(1, 3, 1, projection="3d")
ax2 = fig.add_subplot(1, 3, 2, projection="3d")
ax3 = fig.add_subplot(1, 3, 3, projection="3d")
X, Y = np.meshgrid(tree_depths, forest_sizes)
acc = accuracy["score"].to_numpy().reshape(len(forest_sizes), len(tree_depths))
pre = precision["score"].to_numpy().reshape(len(forest_sizes), len(tree_depths))
rec = recall["score"].to_numpy().reshape(len(forest_sizes), len(tree_depths))
def render_frame(i):
ax1.clear()
ax1.set_title(f"Accuracy")
ax1.contour3D(X, Y, acc, 100, cmap=sns.color_palette("flare", as_cmap=True))
ax1.set_xlabel("tree max depth")
ax1.set_ylabel("forest_size")
ax1.view_init(20, i * 12)
ax1.set_zlim(0.80, 1)
ax2.clear()
ax2.set_title(f"Precision")
ax2.contour3D(X, Y, pre, 100, cmap=sns.color_palette("flare", as_cmap=True))
ax2.set_xlabel("tree max depth")
ax2.set_ylabel("forest_size")
ax2.view_init(20, i * 12)
ax2.set_zlim(0.80, 1)
ax3.clear()
ax3.set_title(f"Recall")
ax3.contour3D(X, Y, rec, 100, cmap=sns.color_palette("flare", as_cmap=True))
ax3.set_xlabel("tree max depth")
ax3.set_ylabel("forest_size")
ax3.view_init(20, i * 12)
ax3.set_zlim(0.80, 1)
ani = animation.FuncAnimation(fig, render_frame, frames=30)
HTML(ani.to_jshtml())
Based on this analysis, we can conclude that, while max tree depth does affect somewhat the performance, the factor that really influences accuracy, precision and recall is forest size (up to a certain point).
The ideal spot seems to be: large enough to maximize performance without wasting computing resources.
The other metric that might be important to monitor is the Precision/Recall balance... we can see that at 600 forest size, precision is still improving, but recall is degrading (probably due to overfitting?)
performance: ModelPerformance = svm(clean, kernel="poly", degree=2, scale=False)
performance.confusion_plot()
Dear Sample SVM,
We hearby crown you "Senior Marketing Director in charge of Sales"
Congrats on the well deserved promotion!
Best Regards,
The rest of the company
Okay... This is just a sample of a random SVM, but we do want to explore how they perform in general.
For this exploration of SVMs, we also want to capture duration (model initialization + training + test) to see if there is a noticeable difference between different algorithms
performance.extra["duration"]
0.04615460000000127
Trying to run SMV without scaling data has proven to be extremely slow (= multiple minutes for "linear w/ 1 degree" model), so we will not compare unscaled v/ scaled.
We'll just make a mental note to always scale the data.
We're going to skip "linear" kernel as it never seems to finish.
Sometimes, the data cannot be classified using a linear model
kernels = ["poly", "rbf", "sigmoid"]
degrees = [0, 1, 2, 3, 4, 5]
scaling = [True, False]
datasets = ["train", "test"]
m = [
svm(clean, kernel=k, degree=d, scale=s, check_using=c)
for k in kernels
for d in degrees
for s in scaling
for c in datasets
]
metrics_svm = metrics_as_dataframe(m)
metrics_svm
| check_using | degree | scale | kernel | metric | score | |
|---|---|---|---|---|---|---|
| 0 | train | 0 | True | poly | accuracy | 0.708622 |
| 1 | test | 0 | True | poly | accuracy | 0.731156 |
| 2 | train | 0 | False | poly | accuracy | 0.708622 |
| 3 | test | 0 | False | poly | accuracy | 0.731156 |
| 4 | train | 1 | True | poly | accuracy | 0.708622 |
| ... | ... | ... | ... | ... | ... | ... |
| 283 | test | 4 | False | sigmoid | f1_score | 0.723478 |
| 284 | train | 5 | True | sigmoid | f1_score | 0.762527 |
| 285 | test | 5 | True | sigmoid | f1_score | 0.786207 |
| 286 | train | 5 | False | sigmoid | f1_score | 0.718818 |
| 287 | test | 5 | False | sigmoid | f1_score | 0.723478 |
288 rows × 6 columns
sns.scatterplot()
grid = sns.FacetGrid(data=metrics_svm, col="kernel", row="metric", ylim=(0.5, 1))
grid.map_dataframe(
sns.scatterplot, x="degree", y="score", hue="check_using", style="scale"
)
grid.add_legend()
<seaborn.axisgrid.FacetGrid at 0x200afe63d00>
We can conclude a few things, from these charts:
Now that we have explored all of the individual "simple" models, let's recover the original objective we had.
Remember that we want to maximize accuracy, but above that, our primary goal is to have great recall.
Let's try to build a few hundred ensemble models with different configurations, and see how we can find the optimal configurations that yield the best results.
Let's create an Ensemble model that is composed of a bunch of KNN models (cheap to train, but comparatively slow to predict).
These are the tweakable parameters of this ensemble model:
Ideally, we'd like to play with all of them, but visualizing 5 metrics (4 for performance + 1 for duration) in a 4-dimensional space will be hard!
So, we might just to a small sample to decide which of the parameters impact the result the most.
performance: ModelPerformance = bagging_knn(
clean,
n_estimators=2,
n_neighbors=5,
max_samples=0.4,
max_features=2,
)
print(performance.extra["duration"])
performance.confusion_plot()
4.061390299999999
Not terrible, not great
n_estimators = [1, 5, 10, 15, 20, 25, 50]
n_neighbors = [1, 2, 3]
max_samples = [0.1, 0.3, 0.5, 0.8, 1.0]
max_features = [1, 2, 3, 5]
bag_knn_performances = {
(e, n, s, f): bagging_knn(
clean,
n_estimators=e,
n_neighbors=n,
max_samples=s,
max_features=f,
)
for e in n_estimators
for n in n_neighbors
for s in max_samples
for f in max_features
}
n_estimators = []
n_neighbors = []
max_samples = []
max_features = []
accuracies = []
precisions = []
recalls = []
f1s = []
durations = []
for model in bag_knn_performances.values():
n_estimators.append(model.params["n_estimators"])
n_neighbors.append(model.params["n_neighbors"])
max_samples.append(model.params["max_samples"])
max_features.append(model.params["max_features"])
accuracies.append(model.accuracy())
precisions.append(model.precision())
recalls.append(model.recall())
f1s.append(model.f1_score())
durations.append(model.extra["duration"])
performances = pd.DataFrame(
{
"n_estimators": n_estimators,
"n_neighbors": n_neighbors,
"max_samples": max_samples,
"max_features": max_features,
"accuracies": accuracies,
"precisions": precisions,
"recalls": recalls,
"f1s": f1s,
"durations": durations,
}
)
performances.sample(10)
| n_estimators | n_neighbors | max_samples | max_features | accuracies | precisions | recalls | f1s | durations | |
|---|---|---|---|---|---|---|---|---|---|
| 308 | 25.0 | 1.0 | 0.5 | 1.0 | 0.756281 | 0.750000 | 1.000000 | 0.857143 | 0.075150 |
| 75 | 5.0 | 1.0 | 0.8 | 5.0 | 0.886935 | 0.886792 | 0.969072 | 0.926108 | 0.032473 |
| 262 | 20.0 | 2.0 | 0.1 | 3.0 | 0.909548 | 0.899687 | 0.986254 | 0.940984 | 0.058104 |
| 245 | 20.0 | 1.0 | 0.3 | 2.0 | 0.889447 | 0.884735 | 0.975945 | 0.928105 | 0.065183 |
| 61 | 5.0 | 1.0 | 0.1 | 2.0 | 0.859296 | 0.863777 | 0.958763 | 0.908795 | 0.044321 |
| 217 | 15.0 | 2.0 | 1.0 | 2.0 | 0.902010 | 0.884146 | 0.996564 | 0.936995 | 0.059607 |
| 119 | 5.0 | 3.0 | 1.0 | 5.0 | 0.907035 | 0.909677 | 0.969072 | 0.938436 | 0.041478 |
| 329 | 25.0 | 2.0 | 0.5 | 2.0 | 0.821608 | 0.807263 | 0.993127 | 0.890601 | 0.091529 |
| 173 | 10.0 | 3.0 | 0.8 | 2.0 | 0.886935 | 0.868263 | 0.996564 | 0.928000 | 0.054871 |
| 27 | 1.0 | 2.0 | 0.3 | 5.0 | 0.889447 | 0.955720 | 0.890034 | 0.921708 | 0.006751 |
charts.pairplot_with_corr(performances)
<seaborn.axisgrid.PairGrid at 0x21a3e818610>
It seems that recall is somewhat correlated with n_estimators.
Increasing all of the other meta parameters does help (positive corr), but the number of estimators contributes the most to improving the overall model convergence unto the correct prediction.
That does not mean that we need to go overboard with this. Given the simple dataset we have, a ton of different combinations ended un with perfect recall.
sns.histplot(performances["recalls"])
<AxesSubplot: xlabel='recalls', ylabel='Count'>
top_recalls = performances[performances.recalls == 1.0].sort_values(
by=["accuracies"], ascending=False
)
top_recalls.head(10)
| n_estimators | n_neighbors | max_samples | max_features | accuracies | precisions | recalls | f1s | durations | |
|---|---|---|---|---|---|---|---|---|---|
| 237 | 15.0 | 3.0 | 1.0 | 2.0 | 0.891960 | 0.871257 | 1.0 | 0.931200 | 0.075590 |
| 293 | 20.0 | 3.0 | 0.8 | 2.0 | 0.879397 | 0.858407 | 1.0 | 0.923810 | 0.065442 |
| 297 | 20.0 | 3.0 | 1.0 | 2.0 | 0.874372 | 0.853372 | 1.0 | 0.920886 | 0.080150 |
| 382 | 50.0 | 2.0 | 0.1 | 3.0 | 0.859296 | 0.838617 | 1.0 | 0.912226 | 0.090328 |
| 357 | 25.0 | 3.0 | 1.0 | 2.0 | 0.856784 | 0.836207 | 1.0 | 0.910798 | 0.073806 |
| 65 | 5.0 | 1.0 | 0.3 | 2.0 | 0.851759 | 0.831429 | 1.0 | 0.907956 | 0.038535 |
| 397 | 50.0 | 2.0 | 1.0 | 2.0 | 0.851759 | 0.831429 | 1.0 | 0.907956 | 0.106347 |
| 353 | 25.0 | 3.0 | 0.8 | 2.0 | 0.851759 | 0.831429 | 1.0 | 0.907956 | 0.070790 |
| 417 | 50.0 | 3.0 | 1.0 | 2.0 | 0.846734 | 0.826705 | 1.0 | 0.905132 | 0.099445 |
| 393 | 50.0 | 2.0 | 0.8 | 2.0 | 0.834171 | 0.815126 | 1.0 | 0.898148 | 0.101280 |
sns.stripplot(
top_recalls,
x="n_estimators",
y="max_features",
jitter=0.1,
hue="accuracies",
)
plt.ylim(0)
(0.0, 3.1)
All of these dots represent a model with excellent recall (1.0).
We are trying to find if there are patterns that result in also great accuracy (dark dots).
After checking various combinations of data slicing and dicing (by setting X and Y axis to different metrics), this chart is the only one that seems to have some patterns. Unsurprisingly, it tells us what we already imagined: more models with access to more features, results in better accuracy (while keeping recall at the peak of 1.0)
When comparing the durations of these models, things dont seem to differ too much, but we're not going to generalise about speed: it's probable that with such a small dataset, the difference in speed is impossible to notice in a significant way (is 0.02 seconds really a decision maker, when we spend minutes, hours or days munging the data manually?)
Alright, so far we have explored some simple/composite models
But none of these comparisons have helped us make a decision about which way to go. We have found models with perfect recall, models with excellent precision, and models that always predicted "YES! Sell this person some travel insurance!"
But what we really want is to find an easy way to assess them all easily and see which one might work best, according to our criteria of "ideal performance"
Remember that our ideal performance was:
We want to create a custom "loss function"/scorer that encodes this definition of success. It should penalise false negatives more strongly than it rewards true positives.
It will also ignore the other metrics because we truly don't care about them... As in, we care about them indirectly only.
def missed_sales_are_painful(y_true, y_pred):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
return (10 * fn) - tp
aggressive_salesman = make_scorer(missed_sales_are_painful, greater_is_better=False)
performance = cross_validation_custom_score(
clean, custom_scorer=aggressive_salesman, n_iter_search=10
)
best_params = performance.extra["best_params"]
best_params
{'C': 0.8802345194777873, 'degree': 0, 'kernel': 'poly'}
This certainly matches our expectation.
If you remember, we had already found this to be an excellent model that guarantees great recall.
Considering how expensive we made the recall errors, it's no surprise that it picks Poly-0-degrees for our custom loss function. The tiny variation in precision/accuracy are not really impacting the final result.

svm_performance = svm(clean, kernel=best_params["kernel"], degree=best_params["degree"])
svm_performance.confusion_plot()
The problem with this model is that it basically makes our policy "always suggest getting insurance".
The fundamental problem is that we have made our loss function too aggressive at punishing problems,... so we ended up with this Alec Baldwin's character in Glengarry Glen Ross:

It seems that our machine learning algorithm we have created a monster.
Let that be a lesson to us all. Models only see the data we present to them, optimization follow our loss function/utility function to the limit.
As it was expected, this scoring function forces our model to be too aggressive and we're worried that this might annoy some customers.
We want to give some penalty if our model is too annoying towards customers who are extremely unlikely to buy insurance.
We want to add a small penalty for those cases.
We will also change the greater_is_better so it's easier to understand our scoring
def missed_sales_are_painful_but_lets_not_annoy_customers(y_true, y_pred):
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
return (tp - (1 * fn)) - (2 * fp)
less_annoying_salesman = make_scorer(
missed_sales_are_painful_but_lets_not_annoy_customers, greater_is_better=True
)
This is the new company policy:
| did not offer insurance | offered insurance | |
|---|---|---|
| did not want insurance | 😐 no one cares | 🛑 let's avoid it if they clearly DON'T want it |
| wanted insurance | ⚠️ no longer "TERRIBLE" but still bad | ✅ GREAT |
performance = cross_validation_custom_score(
clean, custom_scorer=less_annoying_salesman, n_iter_search=10
)
best_params = performance.extra["best_params"]
best_params
{'C': 0.6179572471872472, 'degree': 0, 'kernel': 'rbf'}
svm_performance = svm(clean, kernel=best_params["kernel"], degree=best_params["degree"])
svm_performance.confusion_plot()
This is clearly better... Now our model is balancing offering insurance only when it's really/somewhat confident that they want it... and it's also not offering it in some circumstances.
We have 40 customers who no longer hate us (because we are not annoying them with insurances they didn't really want), and it only had a small cost (4 customers did want it, but we did not offer it).
Keeping 40 loyal customers returning year after year is definitely worth this tiny cost. 👏
Good Job Support Vector Classifier!

In this analysis, we've used a Kaggle dataset about travel insurance policies to compare and contrast various machine learning algorithms.
The algorithms covered are:
The dataset selected for this comparison analysis is almost perfect:
Finally, we have also explored cross validation and using custom loss functions to be able to quantify and rate models' performances according to what we wish to optimize.
Instead of doing like in the rest of the exercise to loop and iterate over various combinations of parameters, we used RandomizedSearchCV to do the heavy lifting for us (searching and selection of optimal paramters for the ideal support vector machine)
Unfortunately, our custom loss function was very strong at penalizing false negatives, so the selected model turned out to be a monster that recommends travel insurance to all of our clients, regardless of their attributes.
Once this was adjusted (by giving it a new score function to optimize), the model became a bit more manageable and only recommended insurance to clients it had a strong confidence they would be interested in.
Let this be a reminder that the system will optimize whatever it is that you rewards, intentionally or accidentally.